The interactive word picker is at the bottom of this notebook!
homework 3, version 0
Submission by: Jazzy Doe (jazz@mit.edu)
Homework 3: Structure and Language
18.S191
, fall 2020
This notebook contains built-in, live answer checks! In some exercises you will see a coloured box, which runs a test case on your code, and provides feedback based on the result. Simply edit the code, run it, and the check runs again.
For MIT students: there will also be some additional (secret) test cases that will be run as part of the grading process, and we will look at your notebook and write comments.
Feel free to ask questions!
"Jazzy Doe"
"jazz"
Let's create a package environment:
Activating new project at `/tmp/jl_7TwWoc`
Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed Compose ─ v0.9.5 Updating `/tmp/jl_7TwWoc/Project.toml` [5ae59095] + Colors v0.12.11 [a81c6b42] + Compose v0.9.5 [7f904dfe] + PlutoUI v0.7.61 Updating `/tmp/jl_7TwWoc/Manifest.toml` [6e696c72] + AbstractPlutoDingetjes v1.3.2 [3da002f7] + ColorTypes v0.11.5 [5ae59095] + Colors v0.12.11 [34da2185] + Compat v4.16.0 [a81c6b42] + Compose v0.9.5 [864edb3b] + DataStructures v0.18.20 [53c48c17] + FixedPointNumbers v0.8.5 [47d2ed2b] + Hyperscript v0.0.5 [ac1192a8] + HypertextLiteral v0.9.5 [b5f81e59] + IOCapture v0.2.5 [c8e1da08] + IterTools v1.4.0 [682c06a0] + JSON v0.21.4 [6c6e2e6c] + MIMEs v1.0.0 [442fdcdd] + Measures v0.3.2 [bac558e1] + OrderedCollections v1.8.0 [69de0a69] + Parsers v2.8.1 [7f904dfe] + PlutoUI v0.7.61 [aea7be01] + PrecompileTools v1.2.1 [21216c6a] + Preferences v1.4.3 [189a3867] + Reexport v1.2.2 [ae029012] + Requires v1.3.1 [410a4b4d] + Tricks v0.1.10 [5c2747f8] + URIs v1.5.1 [0dad84c5] + ArgTools [56f22d72] + Artifacts [2a0f44e3] + Base64 [ade2ca70] + Dates [f43a241f] + Downloads [7b1f6079] + FileWatching [b77e0a4c] + InteractiveUtils [b27032c2] + LibCURL [76f85450] + LibGit2 [8f399da3] + Libdl [37e2e46d] + LinearAlgebra [56ddb016] + Logging [d6f4376e] + Markdown [a63ad114] + Mmap [ca575930] + NetworkOptions [44cfe95a] + Pkg [de0858da] + Printf [3fa0cd96] + REPL [9a3f8284] + Random [ea8e919c] + SHA [9e88b42a] + Serialization [6462fe0b] + Sockets [2f01184e] + SparseArrays [10745b16] + Statistics [fa267f1f] + TOML [a4e569a6] + Tar [8dfed614] + Test [cf7118a7] + UUIDs [4ec0a83e] + Unicode [e66e0078] + CompilerSupportLibraries_jll [deac9b47] + LibCURL_jll [29816b5a] + LibSSH2_jll [c8ffd9c3] + MbedTLS_jll [14a3606d] + MozillaCACerts_jll [4536629a] + OpenBLAS_jll [83775a58] + Zlib_jll [8e850b90] + libblastrampoline_jll [8e850ede] + nghttp2_jll [3f19e933] + p7zip_jll Precompiling project... ✓ Compose 1 dependency successfully precompiled in 3 seconds (28 already precompiled)
Exercise 1: Language detection
In this exercise, we are going to create some super simple Artificial Intelligence. Natural language can be quite messy, but hidden in this mess is structure, which we are going to look for today.
Let's start with some obvious structure in English text: the set of characters that we write the language in. If we generate random text by sampling random Unicode characters, it does not look like English:
"\Ua0c7b𥝩\U6457d\U7c419\Ud813f\Udaba1\Ue4e6e𘑽\Ud095a\U10f259\U9a3d5\U51385\Ue785b\Uedade\U7dce1𤻽\U105aee\Ua70e2\U3877c\U94e42𧹼𭿭\U36658ﲽ\U8a0a7\Ub35ac\Uc99b8\U104281\U10d99\Ua1de2\ufd49\Ue1240\Uae70d\Uc90dc\U1ed5e\U699cc\Ud7426\Ub2f4d\U93dad\Uc082e"
Instead, let's define an alphabet, and only use those letters to sample from. To keep things simple, we ignore punctuation, capitalization, etc, and only use these 27 characters:
'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'j'
'k'
'l'
'm'
'n'
'o'
'p'
'q'
'r'
's'
't'
'u'
'v'
'w'
'x'
'y'
'z'
' '
Let's sample random characters from our alphabet:
wimqftzjinlktsjtzrrdqfkvvozfivmhiehbtgay
That alreay looks a lot better than our first attempt! But still, this does not look like English text - we can do better.
English words are not well modelled by this random-latin-characters-model. Our first observation is that some letters are more common than others. To put this observation into practice, we would like to have the frequency table of the latin alphabet. We can search for it online, but it is actually very simple to calculate ourselves! The only thing we need is a representative sample of English text.
The following samples are from Wikipedia, but feel free to type in your own sample! You can also enter a sample of a different language, if that language can be expressed in the latin alphabet.
Remeber that the button on the left of a cell will show or hide the code.
We also include a sample of Spanish, we'll use it later!
"\tAlthough the word forest is commonly used, there is no universally recognised precise definition, with more than 800 definitions of forest used around the world.[4] Although a forest is usually defined by the presence of trees, under many definitions an area completely lacking trees may still be c" ⋯ 861 bytes ⋯ "d(land)\" (confer the English sylva and sylvan); confer the Italian, Spanish, and Portuguese selva; the Romanian silvă; and the Old French selve, and cognates in Romance languages, e. g. the Italian foresta, Spanish and Portuguese floresta, etc., are all ultimately derivations of the French word. \n"
"Un bosque es un ecosistema donde la vegetacion predominante la constituyen los arboles y matas.1\u200b Estas comunidades de plantas cubren grandes areas del globo terraqueo y funcionan como habitats para los animales, moduladores de flujos hidrologicos y conservadores del suelo, constituyendo uno de lo" ⋯ 1294 bytes ⋯ "os sistemas de raices y como detritos de plantas parcialmente descompuestos. El componente lenoso de un bosque contiene lignina, cuya descomposicion es relativamente lenta comparado con otros materiales organicos como la celulosa y otros carbohidratos. Los bosques son areas naturales y silvestre \n"
Exercise 1.1 - Cleaning data
Looking at the sample, we see that it is quite messy - it contains punctiation, accented letters and numbers. For our analysis, we are only interested in our 27-character alphabet (i.e. 'a'
through 'z'
plus ' '
). We are going to clean the data using the Julia function filter
.
7
9
-5
filter
takes two arguments: a function and a collection. The function is applied to each element of the collection, and it returns either true
or false
. If the result is true
, then that element ends up in the final collection.
Did you notice something cool? Functions are also just objects in Julia, and you can use them as arguments to other functions! (Fons thinks this is super cool.)
We have written a function isinalphabet
, which takes a character, and returns a boolean:
isinalphabet (generic function with 1 method)
true
false
👉 Use filter
to extract a just the characters from our alphabet out of messy_sentence
.
"#wow 2020 ¥500 (blingbling!)"
missing
Here we go!
Replace missing
with your answer.
We are not interested in the case of letters ('A'
vs 'a'
), so we want to map these to lowercase before we apply our filter. If we don't, all uppercase letters get deleted.
👉 Use the function lowercase
to convert messy_sentence_2
into a lowercase string, and then use filter
to extract only the characters from our alphabet.
"Awesome! 😍"
missing
Here we go!
Replace missing
with your answer.
Finally, we need to deal with accents - simply deleting accented charactersfrom the source text might deform it too much. We can add accented letters to our alphabet, but a simpler solution is to replace accented letters with the unaccented base character. We have written a function unaccent
that does just that.
"Égalité!"
"Egalite!"
Turn "áéíóúüñ asdf"
into "aeiouun asdf"
.
👉 Let's put everything together. Write a function clean
that takes a string, and returns a cleaned version, where:
accented letters replaced by their base characters
uppercase converted to lowercase
filtered to only contain characters from
alphabet
clean (generic function with 1 method)
"creme brulee est mon plat prefere"
Got it!
Keep it up!
Exercise 1.2 - Letter frequencies
We are going to count the frequency of each letter in this sample, after applying your clean
function. Can you guess which character is most frequent?
"although the word forest is commonly used there is no universally recognised precise definition with more than definitions of forest used around the world although a forest is usually defined by the presence of trees under many definitions an area completely lacking trees may still be considered a" ⋯ 781 bytes ⋯ "enoted forest and woodland confer the english sylva and sylvan confer the italian spanish and portuguese selva the romanian silv and the old french selve and cognates in romance languages e g the italian foresta spanish and portuguese floresta etc are all ultimately derivations of the french word "
letter_frequencies (generic function with 1 method)
0.0624093
0.00580552
0.0203193
0.0406386
0.106676
0.0319303
0.0261248
0.0341074
0.0602322
0.0
0.00217707
0.0406386
0.0123367
0.0667634
0.0667634
0.0108853
0.0
0.0580552
0.0609579
0.0682148
0.0188679
0.0123367
0.0145138
0.000725689
0.0130624
0.0
0.165457
The result is a 27-element array, with values between 0.0
and 1.0
. These values correspond to the frequency of each letter.
sample_freqs[i] == 0.0
means that the sample_freqs[i] == 0.1
means that 10% of the letters in the sample are the
To make it easier to convert between a character from the alphabet and its index, we have the following function:
index_of_letter (generic function with 1 method)
1
2
27
👉 Which letters from the alphabet did not occur in the sample?
'a'
'b'
Keep working on it!
The answer is not quite right.
Hint
You can answer this question without writing any code: have a look at the values of sample_freqs
.
Now that we know the frequencies of letters in English, we can generate random text that already looks closer to English!
Random letters from the alphabet:
hnocqu ssqlhydfxlhvootfbouhmqristwvgrrzaiohgxjknobxuzaswlrknyacxterh lrpghgxnwxjncinkx fkaocyy jjwyxcuxaxqvzwipxmfxbcsimqmancjniycrghenablofof slpguoixtgcytmxnmwrnrwuwllqhqgjxpbvpmdkvxfmly nik weveswhajluexyajwbkjvbfeah gdnhaksashoeniljmdiqrumipklwnwfm sywyrqqiakekddldqulfgrjoxedgairmb ktuqideqvchuzgcwpqzonhysbsdncyunktkraqseviylmyjqftncmkxjnveuhuvdopkyrnpqraqapqcdbtwugchetl gczhw f pmgn izqqrvxg
Random letters at the correct frequencies:
vi aoei tlgir ncvssea epet v ihlife s hii defawsrrra tieeoa ctlw dseeseyt retttsgth euet nao eotweevrl ufdtti r ef t p snmeihgvnn a i soagsnfdgxssloikfontlmwa sf golamfih lv ehr ndiacotn forneuicsdn h l eeihifhneishsllleucvw rr ro har edardoateoeddh e uexdre htgnhttslg fine ire grecb dcsa rw lg sgy t sphmsl wlnrlnealur epgttdskrs tsdiyrn nitgfaareaistsst oocilettls tsala ese daf bece sn
By considering the frequencies of letters in English, we see that our model is already a lot better!
Our next observation is that some letter combinations are more common than others. Our current model thinks that potato
is just as 'English' as ooaptt
. In the next section, we will quantify these transition frequencies, and use it to improve our model.
rand_sample (generic function with 1 method)
rand_sample_letter (generic function with 1 method)
Exercise 1.3 - Transition frequencies
In the previous exercise we computed the frequency of each letter in the sample by counting their occurances, and then dividing by the total number of counts.
In this exercise, we are going to count letter transitions, such as aa
, as
, rt
, yy
. Two letters might both be common, like a
and e
, but their combination, ae
, is uncommon in English.
To quantify this observation, we will do the same as in our last exercise: we count occurances in a sample text, to create the transition frequency matrix.
transition_counts (generic function with 1 method)
normalize_array (generic function with 1 method)
27×27 Matrix{Float64}:
0.0 0.000726216 0.000726216 0.0 … 0.000726216 0.0 0.00944081
0.000726216 0.0 0.0 0.0 0.00145243 0.0 0.0
0.00217865 0.0 0.0 0.0 0.0 0.0 0.00145243
0.0 0.0 0.0 0.0 0.0 0.0 0.0232389
0.000726216 0.0 0.00290487 0.0087146 0.0 0.0 0.0312273
0.0 0.0 0.0 0.0 … 0.0 0.0 0.00653595
0.00145243 0.0 0.0 0.0 0.0 0.0 0.00798838
⋮ ⋱ ⋮
0.00508351 0.0 0.0 0.0 0.0 0.0 0.000726216
0.00217865 0.0 0.0 0.0 0.0 0.0 0.00145243
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.000726216 0.0 0.0 0.0 0.0 0.0 0.010167
0.0 0.0 0.0 0.0 … 0.0 0.0 0.0
0.0145243 0.00290487 0.00726216 0.010167 0.0 0.0 0.000726216
What we get is a 27 by 27 matrix. Each entry corresponds to a character pair. The column corresponds to the first character, the row is the second pair. Let's visualize this:
Answer the following questions with respect to the cleaned English sample text, which we called first_sample
. Let's also give the transition matrix a name:
👉 What is the frequency of the combination "th"
?
missing
👉 What about "ht"
?
missing
Here we go!
Replace missing
with your answer.
👉 Which letters appeared double in our sample?
'x'
'y'
👉 Which letter is most likely to follow a W?
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
👉 Which letter is most likely to precede a W?
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
👉 What is the sum of each row? What is the sum of each column? How can we interpret these values?"
We can use the measured transition frequencies to generate text in a way that it has the same transition frequencies as our original sample. Our generated text is starting to look like real language!
Random letters from the alphabet:
lxzmih ikwnfkd tomsbgzncmvvybvnwutgyic xqskkfx zpayhwhccgxngelaxwxkqyxcpqdfvjxrgkhbbetgcafunstlmitqvlmssaohyom kliqsorpfwyxzwurwbptqzlf uyzrsmwomskeaeggpgyqltsjsadunzyrshbhfuxuqkqfvihgabg ndiwmssneynazodscobuxnhhcsgiaxsnkwxflpwbxavldrdysjvumuxowylhijkfgrowjbvvqvwcpvhediabuwmakjdqulbpvpsqpkom ewylvzvilmdqmdipgewojeucezwarioxtwlgmryrgpj gwcrmggajnxkosaygwfzbachagkxhgdwsxchrzkolnlq weioz rimubvzhdspp
Random letters at the correct frequencies:
ihsonga isagn m nrroslnetuaa rnme ciwchn fsg iosgtnda asnr oedleywah fmitpkc s h nselrn e esllli orhthgn rdti ac roeiilo eid csft d maelanwti f u twerdhi vi ro niotenolobitallae idisnanotn lelrianierersf nhaahateoaegey eannhttefm s nsnncnninoaylsln sihrmlo trhoecsroatrsb r ovceotdtft aaslthsvihc geto rsnfttiruhswsstc terto n eg nedo t frrsed odaheaa r tiiea eett uuynn oll luo hlfi
Random letters at the correct transition frequencies:
ivalylit ane orinleg oty n ive thayproma witire poth anotarefore t oyland te itisinckin fot t oprengingee cesprostho fls are t est sitisinivatrerdly orenagugiren min d fore wore re cialvesesiespr lis sexpog cas pr ofore glyarderreniore ind tinges ange t roreth ss orosionesefouly flva by uanored in teren m r ca f n wogils than ouesitorlisensthesthithe frevalve ofofilalin piorommalyan impespestinthol
sample_text (generic function with 1 method)
Exercise 1.4 - Language detection
It looks like we have a decent language model, in the sense that it understands transition frequencies in the language. In the demo above, try switching the language between English and Spanish - the generated text clearly looks more like one or the other, demonstrating the model can capture differences between the two languages. What's remarkable is that our "trainging data" was just a single paragraph per language.
In this exercise, we will use our model to write a classifier: a program that automatically classifies a text as either English or Spanish.
This is not a difficult task - you can get dictionaries for both languages, and count matches - but we are doing something much more cool: we only use a single paragraph of each language, and we use a language model as classifier.
Mystery sample
Enter some text here - we will detect whether in which language it is written!
"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"
Let's compute the transition frequencies of our mystery sample! Type some text in the box below, and check whether the frequency matrix updates.
27×27 Matrix{Float64}:
0.0 0.00285714 0.0 … 0.0 0.00285714 0.0 0.00857143
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00857143 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0228571
0.00571429 0.00285714 0.00285714 0.00285714 0.0 0.0 0.0285714
0.0 0.0 0.0 … 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00285714
⋮ ⋱ ⋮
0.00285714 0.0 0.0 0.0 0.0 0.0 0.0
0.00571429 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00285714
0.0 0.0 0.0 … 0.0 0.0 0.0 0.0
0.0342857 0.0114286 0.00857143 0.0 0.0 0.0 0.0
Our model will compare the transition frequencies of our mystery sample to those of our two language sample. The closest match will be our detected language.
The only question left is: How do we compare two matrices? When two matrices are almost equal, but not exactly, we want to quantify their distance.
👉 Write a function called matrix_distance
which takes 2 matrices of the same size and finds the distance between them by:
Subtracting corresponding elements
Finding the absolute value of the difference
Summing the differences
matrix_distance (generic function with 1 method)
missing
missing
Here we go!
Replace missing
with your answer.
We have written a cell that selects the language with the smallest distance to the mystery language. Make sure sure that matrix_distance
is working correctly, and scroll up to the mystery text to see it in action!
Further reading
It turns out that the SVD of the transition matrix can mysteriously group the alphabet into vowels and consonants, without any extra information. See this paper if you want to try it yourself! We found that removing the space from alphabet
(to match the paper) gave better results.
Exercise 2 - Language generation
Our model from Exercise 1 has the property that it can easily be 'reversed' to generate text. While this is useful to demonstrate its structure, the produced text is mostly meaningless: it fails to model words, let alone sentence structure.
To take our model one step further, we are going to generalize what we have done so far. Instead of looking at letter combinations, we will model word combinations. And instead of analyzing the frequencies of bigrams (combinations of two letters), we are going to analyze
Dataset
This also means that we are going to need a larger dataset to train our model on: the number of english words (and their combinations) is much higher than the number of letters.
We will train our model on the novel Emma (1815), by Jane Austen. This work is in the public domain, which means that we can download the whole book as a text file from archive.org
. We've done the process of downloading and cleaning already, and we have split the text into word and punctuation tokens.
splitwords (generic function with 1 method)
"Emma"
"Woodhouse"
","
"handsome"
","
"clever"
","
"and"
"rich"
","
"with"
"a"
"com"
"-"
"fortable"
"home"
"and"
"happy"
"disposition"
","
"in"
"the"
"perfect"
"happiness"
"of"
"the"
"union"
"."
"THE"
"END"
"although"
"the"
"word"
"forest"
"is"
"commonly"
"used"
"there"
"is"
"no"
"universally"
"recognised"
"precise"
"definition"
"with"
"more"
"than"
"definitions"
"of"
"forest"
"etc"
"are"
"all"
"ultimately"
"derivations"
"of"
"the"
"french"
"word"
""
Exercise 2.1 - bigrams revisited
The goal of the upcoming exercises is to generalize what we have done in Exercise 1. To keep things simple, we split up our problem into smaller problems. (The solution to any computational problem.)
First, here is a function that takes an array, and returns the array of all neighbour pairs from the original. For example,
bigrams([1, 2, 3, 42])
gives
[[1,2], [2,3], [3,42]]
bigrams (generic function with 1 method)
1
2
2
3
3
42
👉 Next, it's your turn to write a more general function ngrams
that takes an array and a number
ngrams([1, 2, 3, 42], 3)
should give
[[1,2,3], [2,3,42]]
and
ngrams([1, 2, 3, 42], 2) == bigrams([1, 2, 3, 42])
ngrams (generic function with 1 method)
1
2
3
2
3
42
"although"
"the"
"word"
"forest"
"the"
"word"
"forest"
"is"
"word"
"forest"
"is"
"commonly"
"forest"
"is"
"commonly"
"used"
"is"
"commonly"
"used"
"there"
"commonly"
"used"
"there"
"is"
"used"
"there"
"is"
"no"
"there"
"is"
"no"
"universally"
"is"
"no"
"universally"
"recognised"
"no"
"universally"
"recognised"
"precise"
"universally"
"recognised"
"precise"
"definition"
"recognised"
"precise"
"definition"
"with"
"precise"
"definition"
"with"
"more"
"definition"
"with"
"more"
"than"
"with"
"more"
"than"
"definitions"
"more"
"than"
"definitions"
"of"
"than"
"definitions"
"of"
"forest"
"definitions"
"of"
"forest"
"used"
"of"
"forest"
"used"
"around"
"forest"
"used"
"around"
"the"
"and"
"portuguese"
"floresta"
"etc"
"portuguese"
"floresta"
"etc"
"are"
"floresta"
"etc"
"are"
"all"
"etc"
"are"
"all"
"ultimately"
"are"
"all"
"ultimately"
"derivations"
"all"
"ultimately"
"derivations"
"of"
"ultimately"
"derivations"
"of"
"the"
"derivations"
"of"
"the"
"french"
"of"
"the"
"french"
"word"
"the"
"french"
"word"
""
Got it!
Keep it up!
If you are stuck, you can write ngrams(words, n) = bigrams(words)
(ignoring the true value of
Exercise 2.2 - frequency matrix revisisted
In Exercise 1, we use a 2D array to store the bigram frequencies, where each column or row corresponds to a character from the alphabet. If we use trigrams, we could store the frequencies in a 3D array, and so on.
However, when counting words instead of letters, we run into a problem. A 3D array with one row, column and layer per word has too many elements to store on our computer.
Emma consists of 8465 unique words. This means that there are 606 billion possible trigrams - that's too much!
Although the frequency array would be very large, most entries are zero. For example, "Emma" is a common word, but "Emma Emma Emma" does not occur in the novel. This sparsity of non-zero entries can be used to store the same information more in a more efficient structure.
Julia's built-in SparseArrays might sounds like a logical choice, but these arrays only support 1D and 2D types, and we also want to directly index using strings, not just integers. So instead, we will use Dict
: the dictionary type.
"vegetables"
"🌽"
"🎃"
"🍕"
"fruits"
"🍎"
"🍊"
"🍎"
"🍊"
(Did you notice something funny? The dictionary is unordered, this is why the entries were printed in reverse from the definition.)
You can dynamically add or change values of a Dict
by assigning to my_dict[key]
. You can check whether a key already exists using haskey(my_dict, key)
.
👉 Use these two techniques to write a function word_counts
that takes an array of words, and returns a Dict
with entries word => number_of_occurances
.
For example:
word_counts(["to", "be", "or", "not", "to", "be"])
should return
Dict(
"to" => 2,
"be" => 2,
"or" => 1,
"not" => 1
)
word_counts (generic function with 1 method)
"or"
1
"not"
1
"to"
2
"be"
2
Got it!
Yay ❤
How many times does "Emma"
occur in the book?
missing
Great! Let's get back to our ngrams. For the purpose of generating text, we are going to store a continuations cache. This is a dictionary where the keys are
let
trigrams = ngrams(split("to be or not to be that is the question", " "), 3)
cache = continutations_cache(trigrams)
cache == Dict(
["to", "be"] => ["or", "that"],
["be", "or"] => ["not"],
["or", "not"] => ["to"],
...
)
end
So for trigrams, our keys are the first
If the same ngram occurs multiple times (e.g. "said Emma laughing"), then the last word ("laughing") should also be stored multiple times. This will allow us to generate trigrams with the correct frequenciesas the original text.
👉 Write the function continuations_cache
, which takes an array of ngrams (i.e. an array of arrays of words, like the result of your ngram
function), and returns a dictionary like described above.
continutations_cache (generic function with 1 method)
"or"
"not"
"to"
"to"
"be"
"or"
"that"
"be"
"or"
"not"
"not"
"to"
"be"
"be"
"that"
"is"
"is"
"the"
"question"
"that"
"is"
"the"
"cognates"
"in"
"romance"
"carolingian"
"scribes"
"first"
"latin"
"silva"
"which"
"portuguese"
"selva"
"the"
"floresta"
"etc"
"are"
"the"
"presence"
"of"
"expanse"
"covered"
"by"
"denote"
"the"
"royal"
"grow"
"trees"
"in"
"italian"
"spanish"
"and"
"is"
"commonly"
"used"
"many"
"definitions"
"an"
"completely"
"lacking"
"trees"
"royal"
"hunting"
"grounds"
"old"
"high"
"german"
"definition"
"with"
"more"
"used"
"there"
"is"
"the"
"medieval"
"latin"
"fores"
"denoting"
"forest"
"forest"
"and"
"woodland"
"into"
"english"
"as"
"as"
"the"
"word"
"past"
"will"
"grow"
"forest"
"was"
"first"
"more"
"than"
"definitions"
"or"
"old"
"high"
"introduced"
"into"
"english"
"wood"
"carolingian"
"scribes"
"foresta"
"spanish"
"and"
"of"
"trees"
"under"
Exercise 2.4 - write a novel
We have everything we need to generate our own novel! The final step is to sample random ngrams, in a way that each next ngram overlaps with the previous one. We've done this in the function generate_from_ngrams
below - feel free to look through the code, or to implment your own version.
generate_from_ngrams(grams, num_words)
Given an array of ngrams (i.e. an array of arrays of words), generate a sequence of num_words
words by sampling random ngrams.
Compute the ngrams of an array of words, but add the first n-1 at the end, to ensure that every ngram ends in the the beginning of another ngram.
generate(source_text::AbstractString, num_token; n=3, use_words=true)
Given a source text, generate a String
that "looks like" the original text by satisfying the same ngram frequency distribution as the original.
Interactive demo
Enter your own text in the box below, and use that as training data to generate anything!
Using
uoo eisueanwcesfoah e aetdmriosa tlreutoonhhn o rue s ct iigdlrtmrs ug ehoomirs segnttwv vnlvebthoonepigoprletrl fsesnteeeeeaoagrha ataeder g tvhi nliovelugtvanrroeo rhema oib ess seoedo a eteneso ds er ahet aitdgiis rtsbv r ee vnoiltd soiiih nfgtyncwodd la e obflkeo ra otaupavmdnt o rrtnfwhiw vetusc lile iyf rlachtoeldceiodveoofir s rg tniwrnldftawnrs rrrteas ydoooeles tlnhad sdsm nltcc
Using
forest trees word which high in universally as although the of was from covered the into french frankish the than possibly spanish land used no native grew definitions derived of old g regardless e may if world romance the trees is it recognised via ultimately is french of forest the g for of may the grow sylvan g regardless confer of latin wild commonly borrowing italian word typethe french royal of romanian precise no of also land grounds necessity via probably and having is is confer trees king legally charlemagne latin there may silv via may capitularies expanse for
Automatic Jane Austen
Uncomment the cell below to generate some Jane Austen text:
at least very near , with Miss Woodhouse , who had always meant to try to please — but these sort of visits conveyed , he could make me more than she ought to have 250 EMMA held such a man that a daughter , who seemed so perfectly comfortable without a little gruel to you that , much as they walked on . 5 He listened with the same . What ! think a young lady ; a simple question on that head ; but it was not mixed ; and was re - stored to the resolution of
"Emma"
"Woodhouse"
","
"handsome"
","
"clever"
","
"and"
"rich"
","
"with"
"a"
"com"
"-"
"fortable"
"home"
"and"
"happy"
"disposition"
","
"in"
"the"
"perfect"
"happiness"
"of"
"the"
"union"
"."
"THE"
"END"
"Emma"
"Woodhouse"
","
"Woodhouse"
","
"handsome"
","
"handsome"
","
"handsome"
","
"clever"
","
"clever"
","
"clever"
","
"and"
","
"and"
"rich"
"and"
"rich"
","
"rich"
","
"with"
","
"with"
"a"
"with"
"a"
"com"
"a"
"com"
"-"
"com"
"-"
"fortable"
"-"
"fortable"
"home"
"fortable"
"home"
"and"
"home"
"and"
"happy"
"and"
"happy"
"disposition"
"happy"
"disposition"
","
"disposition"
","
"seemed"
","
"seemed"
"to"
"in"
"the"
"perfect"
"the"
"perfect"
"happiness"
"perfect"
"happiness"
"of"
"happiness"
"of"
"the"
"of"
"the"
"union"
"the"
"union"
"."
"union"
"."
"THE"
"."
"THE"
"END"
"THE"
"END"
"Emma"
"END"
"Emma"
"Woodhouse"
"be"
"out"
"of"
","
"of"
"of"
"of"
"of"
"great"
"pity"
"that"
"indeed"
";"
"assembled"
"there"
","
"she"
"asked"
"."
"me"
","
"could"
"wish"
".'"
","
"to"
"for"
".'"
"."
"for"
"."
"to"
"."
"to"
"long"
"in"
"the"
"equal"
"the"
"coming"
"want"
"reaching"
"),"
"he"
"would"
"introducing"
"Robert"
"Martin"
"had"
"better"
"not"
"say"
"take"
"let"
"not"
"look"
"order"
"come"
"be"
"sense"
"our"
"visit"
"?"
"been"
"waiting"
"the"
"thought"
"them"
"the"
"all"
"a"
"do"
"good"
"to"
"am"
"of"
"Hartfield"
"being"
"My"
"representa"
"-"
"doubt"
"he"
"is"
"was"
"would"
"did"
"only"
"wait"
"till"
"bear"
"her"
"well"
"grate"
"-"
"ful"
"environs"
"could"
"be"
"expense"
"or"
"inconvenience"
"Cole"
"—"
"he"
"merriment"
"it"
"would"
"with"
"astonishment"
"at"
"Till"
"you"
"chose"
"s"
"thoughts"
"all"
"from"
"a"
","
"arrived"
"from"
"Broadwood"
"Mr"
"commend"
"a"
"baked"
"be"
"spoken"
"to"
"oppress"
"her"
".'"
"Emma" – rearranged
must of the
Choose the next word!
deduplicate_and_sort_by_occurance (generic function with 1 method)
Function library
Just some helper functions used in the notebook.
Quote (generic function with 1 method)
show_pair_frequencies (generic function with 1 method)
compimg (generic function with 2 methods)
hint (generic function with 1 method)
almost (generic function with 1 method)
still_missing (generic function with 2 methods)
keep_working (generic function with 2 methods)
Fantastic!
Splendid!
Great!
Yay ❤
Great! 🎉
Well done!
Keep it up!
Good job!
Awesome!
You got the right answer!
Let's move on to the next section.
correct (generic function with 2 methods)
not_defined (generic function with 1 method)